A Galician Textual Corpus for Morphosyntactic Tagging with Application to Text-to-Speech Synthesis

نویسندگان

Lorena Seijo Pereiro

Ana Martínez Ínsua

Francisco Méndez Pazó

Francisco Campillo Díaz

Eduardo Rodríguez Banga

چکیده

This paper will present the morphosintactic tagger and the corpus of contemporary written Galician which are being employed in the development of the Galician version of our tex-to-speech synthesizer. Their quality and accuracy make them useful for speech technology applications and turn them into possible references for further investigation and research projects about Galician language. In essence, the tagger assigns automatically the morphosyntactic categories and other additional labels to the words in the corpus by resorting to a combination of both a reduced (although highly reliable) set of rules, and a stochastic language model that employs class n-grams whose probabilities are trained using the corpus itself. A bootstrapping technique is employed for tagging the texts contained in the corpus: a small amount of text is initially tagged automatically making use of a reduced set of linguistic rules and then, gathering together the results obtained at this stage of the process (after the manual revision of the tagging), an initial statistical model is built. The tagging process may be said to consist essentialy of a number of consecutive automatic-tagging stages that enclose: the use of the latest version of the statistical model, the manual revision, and the subsequent updating of the stochastic model with the correctly tagged text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Morphosyntactic Tagging of Slovene Legal Language

Part-of-speech tagging or, more accurately, morphosyntactic tagging, is a procedure that assigns to each word token appearing in a text its morphosyntactic description, e.g. “masculine singular common noun in the genitive case”. Morphosyntactic tagging is an important component of many language technology applications, such as machine translation, speech synthesis, or information extraction. In...

متن کامل

Análisis morfosintáctico estadístico en lengua gallega

This paper describes a morphosyntactic analyser in Galician which, apart from its obvious linguistic interest, can be easily applied to speech recognition and speech synthesis systems. While rule-driven models produce the better performance, stochastic models have shown a comparable accuracy when properly designed. Moreover, rule-driven models are based on a complex set of linguistic rules, qui...

متن کامل

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for...

متن کامل

Enhanced CORILGA: Introducing the Automatic Phonetic Alignment Tool for Continuous Speech

The Corpus Oral Informatizado da Lingua Galega (CORILGA) project aims at building a corpus of oral language for Galician, primarily designed to study the linguistic variation and change. This project is currently under development and it is periodically enriched with new contributions. The long-term goal is that all the speech recordings will be enriched with phonetic, syllabic, morphosyntactic...

متن کامل

Multi-source morphosyntactic tagging for spoken Rusyn

This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolki...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

A Galician Textual Corpus for Morphosyntactic Tagging with Application to Text-to-Speech Synthesis

نویسندگان

چکیده

منابع مشابه

Morphosyntactic Tagging of Slovene Legal Language

Análisis morfosintáctico estadístico en lengua gallega

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

Enhanced CORILGA: Introducing the Automatic Phonetic Alignment Tool for Continuous Speech

Multi-source morphosyntactic tagging for spoken Rusyn

عنوان ژورنال:

اشتراک گذاری